Exporting Data from MongoDB to GCS Buckets using Dataproc Serverless

Hitesh Hasija
Google Cloud - Community
3 min readSep 17, 2022

--

Apache Spark is usually first choice whenever processing of data within memory is concerned. But, Spark comes up with a maintenance cost of Dataproc Clusters over GCP. This overhead of maintaining a Spark Cluster, creates an obstacle while using Spark for new jobs. Google Cloud Community has come up with Dataproc Serverless design which allows us to run Spark jobs on Dataproc Cluster without worrying about the overhead of maintaining a Dataproc / Spark Cluster.

Dataproc Serverless design could be used for running various kinds of Spark jobs. One of the major use case involves importing and exporting data via Google Cloud Storage (GCS) Buckets.

MongoDB is a very famous NoSQL Document Oriented database. It is a very common use case of Exporting and importing data from MongoDB to some Cloud Storage and vice-a-versa. This article basically covers one such use case, where there is requirement of Exporting Data from MongoDB to GCS Buckets by using Dataproc Serverless design approach.

Many of the companies use MongoDB Atlas (MongoDB managed services over Different Cloud Providers) these days, instead of managing a MongoDB cluster on their own. Hence, this article covers connection to MongoDB Atlas as well using Dataproc Serverless design.

Key Benefits

  1. These templates are open source and can be used by anyone for their workload migration.
  2. These templates are customisable in nature. It means that, the GitHub repository could be clone very easily and can be used ahead as per our requirement by doing the corresponding code changes.
  3. Dataproc Serverless design frees up the developer from the headache of managing a Dataproc cluster.
  4. Supported File formats are JSON, Avro, Parquet and CSV.
  5. These templates are configuration driven and can be used for similar use cases very easily by just changing the connection parameters.

Usage

1. Create a GCS bucket and staging location for jar files.

2. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/python

3. Obtain authentication credentials (to submit the job).

gcloud auth application-default login

4. Execute MongoToGCS template.
Eg: When connection with MongoDB is required :-

export GCP_PROJECT=my-gcp-project
export REGION=us-central1
export GCS_STAGING_LOCATION=gs://staging-bucket
export JARS="gs://jar_location/mongo-java-driver-3.9.1.jar,gs://jar_location/mongo-spark-connector_2.12-2.4.0.jar"
./bin/start.sh \
-- --template=MONGOTOGCS \
--mongo.gcs.output.format="avro" \
--mongo.gcs.output.location="gs://GCS_Bucket_Name/mongogcsoutput" \
--mongo.gcs.output.mode="overwrite" \
--mongo.gcs.input.uri="mongodb://1.2.3.45:27017" \
--mongo.gcs.input.database="demo" \
--mongo.gcs.input.collection="analysis"

Eg: When connection with MongoDB Atlas is required :-

export GCP_PROJECT=my-gcp-project
export REGION=us-central1
export GCS_STAGING_LOCATION=gs://staging-bucket
export JARS="gs://jar_location/mongo-java-driver-3.9.1.jar,gs://jar_location/mongo-spark-connector_2.12-2.4.0.jar"
./bin/start.sh \
-- --template=MONGOTOGCS \
--mongo.gcs.output.format="avro" \
--mongo.gcs.output.location="gs://GCS_Bucket_Name/mongogcsoutput" \
--mongo.gcs.output.mode="overwrite" \
--mongo.gcs.input.uri="mongodb+srv://<username>:<password>@hhasija.wnopa.mongodb.net" \
--mongo.gcs.input.database="demo" \
--mongo.gcs.input.collection="analysis"

NOTE: It will ask you to enable Dataproc Api, if not enabled already.

Schedule the batch job

GCP natively provides Cloud Scheduler + Cloud Function which can be used to submit spark batch jobs. Alternatively self managed softwares like linux cron tab, Jenkins etc. can be used as well.

Setting additional spark properties

In case you need to specify spark properties supported by Dataproc Serverless like adjust the number of drivers, cores, executors etc.

You can edit the OPT_PROPERTIES values in start.sh file.

References
https://medium.com/google-cloud/importing-data-from-gcs-to-databases-via-jdbc-using-dataproc-serverless-7ed75eab93ba
https://github.com/GoogleCloudPlatform/dataproc-templates

https://medium.com/google-cloud/importing-data-from-gcs-to-mongodb-using-dataproc-serverless-fed58904633a

--

--